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© Image audio transformation system, particularly as a visual aid for the blind. 

© Device for converting visual images into sound sequences, particularly to be used as a visual aid for the 
blind, operating on fast working pipelined extended parallel operation electronics. 

; n image is scanned in an orthogonal pattern by means of a low weight; low energy consuming electronic 
sensing system thus providing a poratble visual aid for the patient. 
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IMAGE-AUDIO TRANSFORMATION SYSTEM. 



The invention relates to a transformation system for converting visual images into acoustical representa- 
tions. An article of L Kay in I EE Proceedings. Vol. 131, No. 7, September 1984, pp 559-576, reviews 
several mobility aids for totally blind or severely visually handicapped persons. With some of these aids 
visual information is converted into acoustical representations, e.g. the laser cane, but the conveyed amount 

s of visual information is very low. In fact, these systems are mainly electronic analogons or extensions of the 
oridinary (long) cane, as they are obstacle detectors for a single direction pointed at. Direct stimulation of 
the visual cortex has also been tried, but up to now only with poor success. The disadvantage of having to 
apply brain surgery Is an obvious obstacle in the development of this approach. Another possibility is to 
display an image by a matrix of tactile stimulators, using vibrotactile or electrocutaneous stimulation. The 

10 poor resolution of present modest-sized matrices may be a major reason for a lack of success in mobility. 

Another approach mentioned ip the Kay article is to concert acoustical representations. With this 
approach, called sonar, the problem of ambiguity arises, because very different configurations of obstacles 
may conceivably yield almost the same acoustic patterns. Another problem is that the complexity of an 
acoustic refraction pattern is very hard to interpret and requires extensive training. Here too the special 

75 resolution is rather low due to a far from optimal exploitation of available bandwidth. The range of sight is 
rather restricted in general. 

It is an object of the invention to provide a transformation system in which the restrictions of known 
devices are avoided to a large extent To this end a transformation system, recited in the preambule, is in 
accordance with the invention characterized in that, visual images are on-line transformed into acoustical 

20 representations substantially retaining relevant information content of the visual images. 
The invention is based on the following basic considerations. 

- True visual input seems the most attractive, because it is known to provide an adequate description of the 
environment without distance limitations. Therefore a camera should be used as a source of data to be 
converted. 

25 - An acoustic representation will be used, because the human hearing system probably has, after the 
human visual system, the greatest available bandwidth. The greater the bandwidth, the larger the amount of 
information that can be transmitted in any given time interval. Furthermore, the human hearing system is 
known to be capable of processing and interpreting very complicated information, e.g. speech in a noisy 
environment, considering the flexibility of the human brain. 

30 - The mapping of an image into sound should be extremely simple from the viewpoint of a blind person, it 
must be understood at least for simple pictures in the beginning by any normal human being without much 
training and effort. This reduces psychological barriers. 

- A scanline approach will be used to distribute an image in time. Thus a much higher resolution for a given 
bandwidth is obtained. The time needed to transfer a single image must remain modes to make sure the 

35 entire image is grasped by short-term human memory for further interpretation. The transfer time must also 
be as small as possible to refresh an image as often as possible in order to have an up-to-date 
representation of reality, such as for moving objects. An upper limit to the conversion time will therefore be 
on the order of a few seconds. 

- The image-to-sound conversion must occur in real-time to be useful. 

40 - The system should be portable, low-power and low-cost to be suitable for private use in battery-operated 
mobility applications. 

A preferred embodiment of the invention is provided with a low-weight (portable), low-power (battery- 
feeded) image sensing system preferably incorporating digital data processing such as a CCD camera. 

In a further preferred embodiment an image processing unit is based on orthogonally scanning a grid of 
45 pixels each pixel having one of a row at discrete brightness values. Such a grid may be build up by, for 
example. 64 x 64 pixels with, for example, sixteen possible brightness levels for every pixel. 

A further embodiment comprises pipelined digital data processing for converting digitized image 
information into acoustical representation. Preferably a high degree of parallel processing is used in order to 
improve converting speed. 

so A transformation system according to the invention is preferably suited to read out an image scanline- 
wise, for example in 64 scanlines with 64 pixels on each scanline. Preferably a scaniine position is 
represented in time sequence and a pixel position is represented in frequency or vise versa whilst the 
brightness of a pixel is represented in amplitude of an acoustical representation. 

Some embodiments according to the invention will be described in more detail with reference to the 
drawing. In the drawing shows 
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Rgure 1 an example to illustrate the basic principles to the transformations performed by the invention, 
Rgure 2 a block diagram of a system to perform transformations analogous to the transformations of 

F *9 ure 1 * at- 
Rgure 3 an example to illustrate the design and operation of a waveform synthesis stage within the 

image processing unit of Rgure 2, 

Rgure 4 an illustration showing more detail of a design of the system, in particular the image processing 
unit, of Rgure 2, 

Rgure 5 an illustration showing more details of the control logic of the image processing unit of Rgure 4, 
Rgure 6 an illustration of a Gray-code analog-to-digital converter for the image processing unit of Rgure 
4, 

Rgures 7 and 8 an illustration of an analog output state for the image processing unit of Rgure 4. 

In order to put the description of the invention in perspective, it looks useful to illustrate with a strongly 
simplified example, as given in Rgure 1, the principles of the way the invention transforms visual images 
into acoustical representations. The particular orientations and directions indicated in Rgure 1 and 
described below are not essential to the invention. For example, the transformation may be reconfigured for 
scanning from top to bottom instead of from left to right, without violating the principles of the invention. 
Other examples are the reversal of directions, such as scanning from right to left, or having high 
frequencies at low pixel positions etc. However, for the sake of readability particular choices for orientations 
and directions were made in the description, in accordance with Figure 1, unless stated otherwise/Similarly, 
the particular number of rows, columns or brightness values in the image Is not at all fundamental to the 
transformation. The example of Rgure 1 indicates for the sake of simplicity eight rows and eight columns 
with three brightness values, or grey-tones, per pixel. A more realistic example of a transformation system 
for visual images further described has 64 rows and 64 column, with sixteen brightness values per pixel. 

Rgure 1 shows a chess-board-like visual image 9 partitioned into eight columns 1 through 8 and eight 
rows 1 through 8, giving 64 pixels 10. For simplicity the brightness of the image pixels can have one of 
three grey-tones: white, grey or black. This image can be considered to be scanned in successive vertical 
scanlines coinciding with any of the columns 1 through 8. An image processing unit 11 containing a digital 
representation of such an image, converts the vertical scanlines one after one into sound, in accordance 
with a particular scanning sequence, here from left to right. For any given vertical scanline, the position of a 
pixel in the column uniquely determines the frequency of an oscillator signal, while the brightness of this 
pixel uniquely determines the amplitude of this oscillator signal. The higher the position of a pixel in the 
column, the higher the frequency, and the brighter the pixel, the larger the amplitude. Signals 12 of all 
oscillator signals in the column of a particular scanline are summed and converted with the aid of a 
converting-summing unit 14 into acoustical signals 16. After this total signal has sounded for some time, the 
scanline moves to the next column, and the same conversion takes place. After all columns of the images 
have thus been converted into sound, a new and up-to-date image is stored, and the conversion starts a 
new. At this point in time the scanline jumps from the last, here rightmost column, to the first, here leftmost 
column of the image. 

Due to the simplicity of the transformation, no training will be needed to interpret simple pictures. For 
example, one may consider a straight white line on a black background, running from the bottom left corner 
to the top right corner of the image. This will obviously result in a tone that starts having a low pitch and 
that increases steadily in pitch, until the high pitch of the top right pixel is reached. This sound is repeated 
over and over as a new frame (here with the same contents) is grabbed every few seconds. A white 
rectangle standing on its edge on a black background will sound as bandwidth-limited noise with a duration 
corresponding to the width of rectangle etc. After understanding the simple transformation of image into 
sound, it becomes easy even to imagine the sound of a picture before acutally hearing it. The interpretation 
of more complicated pictures will require more training, as learning a new language does. But right from the 
beginning there is the bonus of understanding simple pictures, which may enhance the motivation of a user 
of the transformation system to proceed in practising. 

An example of a system that performs image-to-sound conversion is depicted in Rgure 2. Images 20 
are transformed into electronic signals by an image sensing unit 22 provided with a unit 24 for conversion 
of images into electronic signals, such as a camera. These electronic signals, which include synchronization 
signals, are processed by an image processing unit 26, in which a number of transformations take place. 
The Image processing unit takes care of the conversion of analog electronic image signals into digital 
signals, which are stored in digital memory 28, after which digital data processing and waveform synthesis 
in a data processing and waveform synthesis unit 30 yields a digitized waveform, which is finally converted 
into analog electronic signals in a DA-conversion and analog output stage 32 for a sound generating output 
unit 34. The sound generating output unit could be headphones or any other system 36 for converting 
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electronic signals into sound, which does not exclude the possibility of using intermediate storage, such as 
a tape recorder. The image processing unit and, if they require a power supply, also the image sensing unit 
and/or the sound generation unit are powered by a power supply 38 as indicated in Figure 2. Dashed lines 
indicate this condition, since for example headphones normally do not require an additional pcver supply. 

5 The power supply may be configured for battery operation. 

in the following, architectural considerations for an image processing unit and ain particular the digital 
data processing and waveform synthesis unit within the image processing unit, as indicated in Figure 2, are 
described. Transformation of image into sound can be seen as a flow of data, undergoing a relatively 
complicated transformation, this complicated transformation can be decomposed into different processing 

to stages, while reducing the complexity of the transformations taking place in individual stages. This will in 
general also allow for a reduction in processing time per stage. Data leaving a stage can be replaced by 
data leaving the previous stage. The processing of a number of data takes place in parallel, although in 
different stages of the total transformation. Thus a much larger data flow can be achieved by organizing the 
architecture as a pipeline, i.e. a sequence, of simple stages. In the design for the invention such a pipelined 

75 architecture has been applied. Without this kind of parallelism, a fast mainframe would have been needed to 
get the same real-time response. Obviously, the system would then not be portable, nor low-cost, nor low- 
power. The image processing unit can be seen as a special purpose computer with mainframe capabilities 
on its restricted task, enabling real-time image-to-sound conversion using a clock frequency of only 2 MHz. 
This in turn enalbes the use of standard normal-speed components like EPROMs (Erasable Programmable 

20 Read-Only Memories) with 250 ns access times and static RAMs (static Random-Access Memories) with 
150 ns access times, thereby reducing the cost of the system. Only the two phases of a single system 
clock are used for synchronization. The system emulates the behaviour of a superposition of 64 amplitude- 
controlled independent oscillators in the acoustic frequency range. There are reasons for taking this 
number, although other numbers can be used without violating the principles of the invention. Yet, for 

25 simplicity of the description the number 64 and numbers derived from it are used on many occasions 
without further notice. The 64 oscillators do not physically exist as 64 precisely tuned oscillator circuits, 
because this would probably give an unacceptably high system cost and size. Instead the system employs 
the high digital speed to calculate in real-time what the superposition of 64 independent oscillators would 
look like. To do this, it calculates a 16-bit sample of this superposition every 32 microseconds, thereby 

30 giving a 16-bit sample frequency of 31.25 kHz of the resulting acoustic signal (which, for comparison, is 
close to the 44.1 kHz 16 bit sample frequency of CD-players). The 64 oscillators that have to be handled in 
the 32 microseconds, allow for 500 nanoseconds per oscillator sample. A sufficiently parallel system will 
therefore be able to do this job with a system clock frequency of 2 MHz. 

Figure 3 illustrates an algorithmic structure of the waveform synthesis, by indicating schematically how 

35 a single waveform sample is calculated within the image processing unit 40. This single waveform sample 
corresponds directly with a single sound sample. The algorithmic structure of Figure 3 can also be viewed 
as a block diagram of the waveform synthesis stage within the image processing unit of Figure 2. For a 
particular scanline, as in the example of figure 1, all pixels i on that scanline are processed. For a pixel i, a 
new phase <fri + A$i is calculated, by adding a phase increment to a previous phase retrieved from memory. 

40 The result becojmes the argument of sine function in a scaled sine module 42, the resulting sine value 
being multiplied by a pixel brightness value Ai. In the image processing unit, this multiplication need not 
physically take place, but the results of one of ail the possible multiplications may be retireved from a 
memory module 42. The scaled sine value resulting from this (implicit) multiplication is added to results 
from previous pixels. When the results for all pixels i» 64 in the detailed design, have thus been 

45 accumulated, the overall result is a single sample of the emulated superposition of 64 oscillators. 

In the waveform synthesis unit indicated schematically in Figure 3, several pixels are processed 
simultaneously, although in different stages of the pipelines processing. A more detailed description of this 
parallelism is given in the following. In the system the following operations take place in parallel, once the 
image-to-sound conversion has started: 

so - At clock phase L(ow), duration 250 ns, an address is calculated for phase-change memory containing the 
phase changes per time step of each of the 64 oscillators. This time step is the step corresponding to the 
final sample frequency of 1/64 times the system clock frequency (here 2 MHz), I.e. 32 us. The above 
address corresponds to the phase change of one particular oscillator. The new phase of another oscillator 
(calculated in earlier cycles) is stored, and at the same time used as an address for sine memory containing 

ss the sine as a function of phase, scaled by an amplitude measure. The 4-bit amplitude measure is 
sumultaneousiy provided by video memory containing the frame as it was grabbed by a 4-bit analog-to- 
digital converter (ADC). The sum of scaled sine samples that came from previously handled oscillators is 
stored in a register. This is needed to obtain one final sample of the superposition of 64 amplitude- 
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controlled digital oscillators. _. ,. 

- At clock phase H(igh). duration 250 ns. an address is obtained for phase memory ^ b imrt 
phases of all oscillators. The phase of an oscillator is read from this memory and "^^j*«J 
change, obtained from phase-change memory for which the address was calculated at clock phase ,LA 

s new address for the video memory is calculated. This address corresponds to the next plxe to be 
processed. A scaled sine value is read from sine memory. The scaled sine value of an cI»mMv is 
added to the sum of scaled sine values of previously handled oscillators. This is part of the process of 

calculating the superposition of 64 oscillators. 

After 64 of such^ystem clock cycles. i.e. 32 us. a sample of the superposrtion 

,o controlled oscillators Is ready. This value can be sent to a 16-bit digital-to-analog converter DAC 50 and an 
analog output stage 52. and from there to a sound generating output unit such as a headphone 54 (Figure 

n In the above description, the clock phases H and L could have been interchanged, this is done 
consistently throughout the text The particular choice made is not fundamental, but a cho.ce must be 

TS mad Is described above, both the sine amplitude scaling and the frequencies , .of the . "cjlatcw are 
determined by the contents of memory. The main reason for using memory instead o ^dedicated ^ardware 
is that this provides the ability to experiment with arbitrary mappings. ^^^^^^ 
multiplication by pixel brightnes may give better contrast as perceived by the user. Non-equ.distant 
oTSor frequences will certainly give better frequency resolution as perceived by ttje «^jojjj. ft. 
human hearing system is more sensitive to differences in pitch at the lower end d f the spe «m I" a 
preferred embodiment a "wohl-temperierte" set of frequencies «s used meanmg that We next ^higher 
frequency is a constant factor times the previous frequency. The resulting non-equid stent fancies 
appear subjectively as almost equidistant tone. Another important reason for avo,d.ng dedicated hardware is 
me cost. Memory is cheap, provided the transformation to be performed is simple enough to fit ma few 

"Ta' preferred embodiment a video camera Is used as the input source. Whenever the previous Image- 
to-sound conversion has finished, a video fram is grabbed and stored in electron* memory in ' *ea*0le 
frame time of the camera, e.g. one 50th of a second in most European video cameras, andone 60th of a 
seTond in most American video cameras. This is to avoid blurred Images » jr. 

moving camera. In the detailed design descriptions the use of the one 50th second 1ntt »*^£J~ 
television standard is assumed for the choice of time constants used to grab the video signal, but his k not 
fundamental to the invention. The image is stored as 64 x 64 pixels with one of 16 poss.ble brightness 
valueV i e grey-tones, for each pixel. Other numbers of pixels or brightness values could have been used. 
35 wtoS v^latTgte principles of the invention, but there are practical reasons for jWs particular ^holce^ 
Thence next Tmage-to-sound conversion starts. The electronic image can be centered as cons,sbng of 
64 vertical lines, each having 64 pixels. First the leftmost verical line is read from 
this line is used to excite an associated digital harmonic oscillator. A pixel portioned higher in the vetoed 
Hne con-esponds to an oscillator of higher frequency (all in the acoustic range). The greater the brightness 
40 of a pixel, the larger the amplitude of its oscillator signal. Next all 64 oscillator s.gnals are summed 
Supen^sed to obiain a total signal with 64 Fourier components. This signal is sent to a DA-converter and 
output through headphones. After about 16 milliseconds the second (leftmost but one) vertical £> - 
from memonr and treate the same way. This process continues until al 64 vertical lines have been 
converted into sound, which takes approximately one second. Then a new video frame .s grabbed and the 

45 ^S^eforth'etLge is assumed to be scanned from the left to the right, without exduding other 
possibilities for scanning directions. The horizontal position of a pixel is then represented by the moment at 
which it excites its emulated oscillator. The vertical position of a pixel is represented by ^ Pto of rts 
associated oscillator signal. The intensity, or brightness, of a pixel is represented by the intensity, or 

so amolitude of the sound corresponding to its associated oscillator signal. 

so ^f^ e h s e ° arjng syste ^ js q S te capabl9 decomposing a complicated signal in its Fourier compo- 
nents (music, speech), which is precisely what is needed to interpret the vertical posrtjons of p«eto. The 
Stowing sections will discuss the limits of Fourier decomposition in the bandwidth limited human hearing 
system, which will show that more than 64 independent oscillators would not be useful. _ 

A picture is represented by the system as a time-varying vector of oscillator signals. The vector 
elements are the individual oscillator signals, each representing a pixel height and brightness The ear 
receives the entire vector as a superposition of oscillator signals. The human hearing system ™stbeable 
to decompose the vector into its elements. The hearing system is known to be capable of performing 
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Fourier decomposition to some extent. Therefore the oscillator signals will be made to correspond to Fourier 
components, i.e. scaled and shifted sines. The hearing system will then be able to reconstructe the 
frequencies of individual harmonic oscillators, and their approximate amplitude (i.e. reconstruct pixel heights 
and brightness). 

5 Another related criterion for chosing a particular waveform is that the image-to-sound conversion should 
preferably be bijective to preserve information. A bijective mapping has an inverse, which means that no 
information is lost under such a mapping. To preserve a reversible one-to-one relation between pixels and 
oscillator signals, the oscillator signals corresponding to different pixels must obviously be distinguishable 
and separable, because the pixels are distinguishable and separable. The waveforms (functions) involved 

io should therefore be orthogonal. A superposition os such functions will then give a uniquely determined 
vector in a countably infinite dimensional Hilbert space. A complete (but not normalized) orthogonal set of 
basis vectors in this Hilbert space is given by the functions 1, cos nt, sin nt, with n positive natural numbers. 
Of course only use a small finite subset of these functions will be used due to bandwidth limitations. 

Other reasons for using harmonic oscillators are the following. Harmonic oscillation is the mechanical 

75 response of many physical systems after modest excitation. This stems from the fact that the driving force 
towards the equilibrium position of most solid objects is in first order approximation linearly dependent on 
the distance from the equilibrium position (Hooke's law). The resulting second order differential equation for 
position as a function of time has the sine as its solution (in the text amplitude and phase will implicitly be 
disregarded when not relevant, and just the term "sine" will be used for short), provided the damping can 

20 be neglected. The sine is therefore a basic function for natural sound, and it may expected that the human 
ear is well adapted to it Furthermore, the construction of the human ear also suggests that this is the case, 
since the basal membrane has a difference in elasticity of a factor 100 when going from basis to apex. The 
transversal and longitudinal tension in the membrane is very small and will therefore not contribute to a 
further spectral decomposition. However, it should be noted that the brain may also contribute considerably 

25 to this analysis: with electro-physiological methods is has been found that the discharges in the hearing 
nervers follow the frequency of a pure sine sound up to 4 or 5 kHz. Because the firing rate of individual 
nerve cells is only about 300 Hz, there has to be a parallel system of nerves for processing these higher 
frequencies. 

Periodic signals will give rise to a discrete spectrum in the frequency domain. In practice, periodic 
30 signals of Infinite duration cannot be used, because the signal should reflect differences between succes- 
sive scaniines and changes in the environment, thus breaking the periodicity. The Fourier transform of a 
single sine-piece finite duration does not give a single spectral component, but a continuous spectrum with 
its maximum centred on its basic frequency. The requirement is that the spectrum of one sine-piece 
corresponding to a particular pixel is clearly distinguishable from the spectrum of a sine-piece correspond- 
as ing to a neighbouring pixel in the vertical direction. 

(In the following, ![.,.] denotes "the integral from . to while « denotes the greek letter omega, «0 
denotes omega with subscript 0. PI denotes the greek letter PI etc. In general, a close typograhical 
analogon of the conventional mathematical symbols is used. 

The Fourier transform is given by: 
40 F(«) = ([-in, inf] f(t) exp(-jwt) dt 

The Fourier transform of an arbitrary segment of a sine will now be calculated, by transforming the 
multiplication of a sine of infinite duration with a pulse of finite duration. This situation corresponds to a 
bright pixel at a particular height in a (vertical) scanline, surround by dark pixels at the same height in 
neighbouring scaniines. 

45 The Fourier transform of a pulse with amplitude 1 . starting at a and ending at b is calculated as: 
F(*>) = l[a,b] exp(-jwt) dt » (2/*>) sin (a,(b-a)/2) exp(-jai(a + b)/2) 
Next the modulation theorem is applied: 



so Fourier 

Letf(t) < > F(o) => s(«) = f(t) sin(a>0t) < > 0,5 [F(« + «0)-F(ftr«))] = S(«) 

Substitution of the above expression for F(«) and some rewriting finally gives, with b-a = k*2Pl/«0. 
55 S(») = |F(«)| /V 2 = (see handwritten calculations). 

Analysis of this formula shows that for o> >1, k> >1 and |«0-«|< <|w0 + o| the highest peaks occur in 

the term: 

G(«) « (kPI/«0)^2 (sync(k(«-«0)PI/a>0)P2 
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This simple formula will be used as an approximation of S(«). It should be kepi .n mind thartthis 
approSmaTon is onfy accurate near the maxima of S(«), but that is sufficient for the purposes <* ^ 
Z b^avtour of IGHI scaled to its maximum, is given by the sync funct.cn |sunc(k(«-«oya > 0)| (with 

. »%aJ£Z quite weHby |1/(PU)|. The next largest maxima are found « * ££££ 
o 21 7 i e a fifth of the main maximum. It seems reasonable to take as a rule of thumb, that a neighbouring 
SSiiSr - - should gibe an (x|-value no iess than spectrum with basic * 

KT2 the ferquency of one of the next-to-main maxima of the original f^fiS^-™ IZl 

sss« g £— i ^-st^; xtjezzz 

^o^let ^sefulTZ ba^ be 2 give by i An e^distant 
would men give DfO = B/N (N pixels In a vertical line of the image). Comb.nat.on of the above formula s 
« 2™^2> = x1 Taking x1 = 1.430, an estimated bandwidth of SkHz and a conversion time of 1 

second y2r N < = sSlSrS) = 59. Therefore the digitai design is perferabty made wrth N = 64 (a 

Tan do a loTof image processing to extract useful information from noisy and blurred ^ s f 0 ^^ e - 
be too pessimistic. 
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IMAGE PROCESSING UNIT 

^fgu^^ the number of bW Because in 

itm d Jlon Vera is not clearcut separation between the concepts of data buses and address buses no 
ZX^TtZSU data generated by one part of the system are used * addresses by 
anX/part Approximate coordinates in Figure 4 are indicated by two-character X, paw. Apart iron X, 
SofdmaS a.so mnemonic names are given of the components as used in ^^^J^^ 
oTthe detailed circuit topology, which describes real components assoc.ated w.th the mnemon.c names and 

Sunng frame grabbing, the multiplexer MUX bypasses the 7-bit middle counter to give a 12S-foW increase 
fn hf f^endes of the most significant counter bits (the bottom counter). This « needed to grab a video 
£ame Si TSTx ms (50 ^ television signal singie frame time and thus avoid blurred ™0~ ™> ■ « 
Isb's e mit significant bits, of the bottom counter aiways indicate a particular vertcal^ scarce 
Szon^"poSon) The middle counter just ensures that it takes some time (and sound samples) before 
veX scanline is going to be converted into sound. The s MHz input to the top counter causes 
S Sh-Oi every 500 ns. During image-to-sound ^^^SSS 

K^J^nnter chanaes everv 16.4 ms. so the conversion of the whole image. i.e. 64 vertical scaniines, 

bits of the middle counter, which is the purpose of the sw.teh S« at Yi. In that case tne coumers 
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configured as one large 24-Mt counter. However, in the discussion a 7-bit middlecounter * assumed ifljQS 
f conversion time), giving a 23-bit total counter (bit 0 through 22). unless stated otherwise. Addresses 
generated by the counters go unlatched to phase change EPROMs DFI1.2 (at Rl). while the r are latch ed by 
L1CNT L2CNT before going to phase SRAMs FI1.2 (at Ni). Thus care has been taken of the fact that the 
EPROMs are much slower (250 ns) than the SRAMs (150 ns). Therefore Ihe EPROMs receive their 
addresses 250 ns earlier. A phase change of a particular oscillator read from *e EPROM^ is added _to a 
present phase read from the SRAMs. Summation takes place in 4-bit full adders AD1-4 at (Bl-Ki). and the 
result is latched by octal latches L1FI, L2FI (at Cj. Ij) before being rewritten into the SRAMs. The new phase 
is also sent through latches L3FI. L4FI (at Dg). together with 4-bits pixel brightness Information corning from 
video SRAMs PIX1-4 (at Xb-Xe). After a possible negation (ones complement) by excluslve-ORs XORi-3 (at 
Bf-Gf) the phase and brightness are used as an address for the sine EPROMs SIN1.2(at Kf). These give a 
sine value belonging to a phase range 0...PI/2 (1st quadrant), and scaled by the brightness value. The .whole 
2PI (four quadrant) phase range is covered by complementing the phase using exclusive^ORs and by 
bypassing the the sine EPROMs with an extra sign bit through a line passing through a D-flipflop DFF2 at 
Ae- this fllpflop gives a delay ensuring that the sign bit keeps pace with the rest of the sine bits. This sign 
blt'determines whether the ALUs ALU1-5 (at Mc-Ac) add or subtract. The ALUs combine ttne results of aH 
64 emulated oscillators in one superposition sample. The latches L1SIN. L2SIN, L3SIN at Nd, Id, Cd are just 
for synchronization of the adding process. When the superposition has been obtained after. 64 system clock 
cycles, the result is sent through latches L1DAC. L2DAC (at Oa) to a 16-bit digital-to-analog converter DAC 
(at (Oa). The inverter at the bottom of Figure 4 serves to give an offset to the summation process by the 
ALUs after clearing the latches at Cd. Id, Nd. The DAC (hexadecimal) input range is OOOOH til FFFFH so 
the starting value for the addition and subtraction process should be halfway at BOOOOH to stay wrthin th.s 
range after adding and subtracting 64 scaled sine samples. The design as indicated keeps the .super- 
position almost always within this range without modulo effects (which would occur beyond OOOOH and 
FFFFH). even for bright images. This is of importance, because overflows cause a distracting clocking or 
cracking noise. The average amplitude of the superposition will grow roughly with the square root of the 
number of independent oscillators times the average amplitudes of these oscillators. This can beseend 
from statistical considerations when applying the central limit theorem to the oscillator signals and treating 
them as stochastic variables, and simplifying to the worst case situation that all oscillators are at their 
amplitude value (+ or -). Therefore the average amplitude of such a 64 oscillator superposition will be 
about 8 times the amplitude of an Individual oscillator, also assuming equal amplitudes of all oscillators as 
for a maximum brightness image. This factor 8 gives a 3-bit shift, which means that there must be 
porvisions for handling at least 3 more bits. This is the purpose of ALU5 (at Ac), which prov.des 4 extra brts 
together with part of L3SIN (at Cd) (*)- The output of the DAC (at Qa) is sent through an analog output 
stage, indicated only symbolically by an operational amplifier (at Sa). Finally the result reaches headphones 
(at Ta) 

O Numerical calculations on sine superpositions showed that for a very bright image field. 3 bits would 
cause overflow during 16% of the time, whereas 4 bits would cause overflow during 0.5% of the time. 
Experimentally, this appears to be disturbing. Overflows for large and very bright image parts are heard as 
a disturbing 'cracking- sound. Division of all sine values in the EPROMs by factor 4 cures this overflow 
problem with no noticable loss of sound quality (a 16-bit sine value would have been rather redundant 
anywsy). 

In the above a particular oscillator sample has more or less been followed through the system. When 
the image-to-sound conversion is complete, the frame grabbing process starts, triggered by bit 21 of 

counter CNT3 (at Zg). . 

An analog video signal from an image sensing unit, such as a camera, indicated symbolically at 713, is 
sent through a sample-and-hold circuit SAM, indicated in Figure 4 as part of teh ADC block at Uc, and 
converted to a 4-bit digital signal in Gray code by an AD-converter ADC1-15 (at Uc). A more detailed 
illustration of a sample and hold circuit is given in Figure 8. This serves to reduce the probability of getting 
very inaccurate results due to transition states (spikes and glitches). In subsequent Gray-coded numbers 
only one bit at a time changes. The 4-bit code is then stored in video SRAM. i.e. PIX1-4 (at Xb-Xe), which 
receives its addresses from counters CNT1-3 (at Zg-Zk). Two bits of this address are used for chop 
selection by a demultiplexer DMX (at Zc). Components around Rd form the control Logic (or "random" 
logic) that takes care of detailed timing, synchronization and mode switching (frame grabbing versus image- 
to-sound conversion). This logic is depicted in more detail in Figure 5. The meaning of the symbols in the 
control logic is the following. The tau's represent triggerab*e delays as monoflops MON. Small sigma's 
represent the horizontal and vertical synchronization pulses of the video signal, the outputs of the 
comparators ADHSNC and ADVSNC. The delta's represent differentiating subcircuits that generate a three- 



8 



EP 0 410 045 A1 



gate-delay spike on the trailing edge of an input pulse. This spike is long enough to trigger subsequent 

dfC tl prototype implementation according to the description in the invention the <^ 
unit i e Z whole circuit for frame grabbing and image-to-sound conversion, has been implemented on a 
sinole circuit to^d f measuring only 236x160 mm. This proves that the construction of a modes-sized 
Sbl Astern T«Se The only devices not on the board are the camera and the power supply. The 
ttttJ^tt^U* is tine camera of the ^J^^*^^^ 
power consumption of the system appeared to be quite low: ameasured 4.4* 12 J / VjJJJ 
excluding the camera This is obtained while using mainly LS TTL components. So the system 
practical^ battery operation. The power consumption could be ^J™***^ ' SStSSrSi' 
CMOS ICs LS TTL is attrachtjve because of its combination of low cost and high electronic staDiinv. me 
SS SnS-lS TMK3412 has a specified power consumption of 2.7- This 
reduced ty using a CCD camera instead, having a typical 100m*, power consumption. St", even the total 
Dower consumption of 2.7 This too could be uch reduced by using a CCD camera mstead. hav ng a 
S5^TS«r consumption. Still, even the total power consumption of 7A - « monocel.s (each 
12V 4A hr) would last more than six hours, which should be more than enough for a full day. 

To obtain 64 different oscillator frequencies at the very least 64 dtfferent values are needed \* the 
u SZzL IZv ie 6 bits But to be able to vary the differences in frequency for successive 
5S^»£i£i £ exJU. if those differences are to ^ase Jinearjy wtt jjjJJJjr 
number. 63 different phase change differences have to be represented, 

Then 6 bits would be needed to indicate a basic oscillator phase change, and an other 6 bite to npwrt 
lie difference in phase change from the other oscillators' phase changes. So 12 b.ts phase change 
acJZls te miEm for a set of frequencies in which the differences between success,ye fraquenctes 
^ n^ ^oUmcr number. A good choice is the use of 16 bits accuracy in order to obtain extra 
freedom for non-llnearly increasing frequency differences. n h M ie 

As illustrated in Figure 4. not all 16 phase bits are used in calculating the sine values. The phase is 
truncated to most significant 12 bits to free 4 bitlines that carry amplitude information fronv video 
tothe sine memory for scaling. The truncation causes an error in the phases. The srze of the error 
oSS on Zes7*elL?lJl v^cM phase bits. Because there is not intended special Ration 
Seen a sT.25 kHz system sample frequency and an emulated oscillator signal, the to™**™**™* 
^iconsidered as noise in successive samples of a particular oscillator. This is phase nose. resulting* 
frew noise of the oscillator. The frequency noise must be much less than the frequency difference 
Seen neighbouring oscillators to prevent loss of resolution. The average frequency .s still specified by 
STTlTSSSrSe™. but the actual signal jumps a little around this frequency because^ tte 
tn^ncation tL frequency noise can be calculated. Let the system clock have fre ^f^ 
3em clock supports 64 oscillators sequentially. F/64 oscillator samples per second ™J 6l ™ Br %*nf!S 
a sine value from the whole 2'PI radians period (10 bits specify PI/2 radians in the sine EPROMs 

us '"* m0re bitS (tmnCati0n) iS therBf0re l9SS . *" * e SfSSL? 
JifTbi. correspond^ to a 1/2*12 period phase stop .error when goin^rom ™™f^™%% 

the next This in turn corresponds to a frequency noise of (F/64)/2*12 - 7.63 Hz tor f * winz. in» w 

indeed much less then the frequency differences between neighbouring oscillators. For example, an 

eSs^Z^Socy distribution of 64 osclilators over a bandwidth B gives steps i « ««l»^>f« 

B/(64-1M-etB = 8kHz. then this gives 79.4Hz> >7.63 Hz. Many non-equidistant schemes also match the 

*™^TonZ^ZT£ 16 times more accurate (4 b.ts). it is noted, that a frequency In »e 
ptJ^S^^r <™ ^e programmed with a resolution of about 0.5 Hz. which g.ves more than 
enough flexibility in choosing a non-equidistant frequency distnbution. 



VIDEO FRAME GRABBING 

The mnemonic component names and two-character coordinates in the following text still refer to a 
Flaure 4 ™he associated appendix for circuit topology. A conventional video camera can be used on rts 
2 I wri li tel ees). such P L image scanning may actually take place from bottom to top for «*h 
sea-line and from left to right tor successive scanlines. This is just to have the same pixel order for frame 
orSbTng as fo7 Tmage-to-sound conversion. Then just a single address generator suffices (the counter 
SS5) which saves a lot of components. Thus system size and cost can be reduced. In order to avoid 
cession, a frame grabbing sytem is described as if the camera were not tilted. This means that the 
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conventional names for video signals are used. It should be kept in mind that with this convention a 
horizontal scanline of the video frame will represent a vertical scanline in the user image. The term user 
Is used when indicating the image the user perceives. 

A video frame is scanned by the camera 50 times per second, assuming the European PAL convention 
for convenience, as stated before, because other television standards would only slightly Iter the descrip- 
tions The scanning takes place from left to right (64 us per horizontal line. i.e. 15.625 Hz), and from top to 
bottom (312/313 lines). The 625-line PAL standard black & white video signal applies interleaving, 
attematingly scanning 312 vertical lines in one 20 ms frame and 313 lines in the other. Because only a 
vertical resolution (horizontal to the user) of 64 pixels is needed, one arbitrary 20 ms frame is grabbed to 
minimize blurred images. The system frequency Is independent of the camera frequency, so synchroniza- 
tion is needed for frame grabbing. The frame grabbing process is enabled when bit 21 of the counter 
CNT1-3 becomes low. which happens when the previous image-to-sound conversion is completed. The 
system then halts (clock disabled) until a vertical synchronization pulse in the video signal occurs 
(ADVSNC). Subsequently a monoflop MON adds another 1.52 ms delay. This ensures that the top margin 
of the frame is skipped (the left margin to the user, which Is also Invisible on the monitor). Another monoflop 
MON is then enables, and triggered by a horizontal synchronization pulse in the video signal (ADHSNC). 
This adds another 15.2 us delay to skip the left margin of the frame (the bottom margin to the user, also 
invisible on the monitor). Then the system clock is enables, allowing the counter to count freely while the 
first clock is enabled, allowing the counter to count freely while the first horizontal line (leftmost vertical user 
line) is scanned. During one clock phase the video signal is sampled in a sample-and-hold stage and 
converted into Qray-code by a set of comparators. During the other clock phase, when the 4-bit code has 
stabilized, the resulting digital brightness value (grey-tone) is latched and store in video memory, while the 
sample-and-hold stage is opened for capturing the next video sample (pixel). The clock is disabled after 64 
clock pulses. i.e 32 us. which covers most of the area visible on the monitor. By tyen the first 64 pixels 
have been stored. After a horizontal synchronization pulse and a 15.2 us delay from the monoflop the clock 
is enabled again for the second line etc. Because of the counter configuration CNT1-3. with CNT2 
bypassed, only every fourth line of the video signal causes an PIX1-4. So only the last of every four lines is 
actually remembered for later use. The others are overwritten. This is meant to make uniform use of the 
312/313 line frame. Effectively grabbing only every fourth line ensures that the 64 vertical pixel resolution 
covers 64*4 = 256 lines of the frame, which is almost the whole visible field of the camera. After 256 
scanlines. bit 21 of the counter becomes high, disabling the frame grabbing and restarting the image-to- 
sound conversion. « 

In the following table a schematic overview of frame grabbing is given indicating steps A tnrougn r. 
A: bit 21 becomes low, disabling the system clock 
35 B: a vertical synchronization pulse triggers the first monoflop 

C: the delayed pulse from the first monoflop enables the second monoflop 
D: a horizontal synchronization pulse triggers the second monoflop 
E: the second monoflop enables the system clock 
F: bit 5 of CNT1 disables the system clock after every 64 pulses 

40 
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A 

SCREEN B 
++++++++++++++++++++++++++++++++++++ : ++++++++♦+++ 

+ : 

+ <TV> (1.52 ms). 

+ ' • 

+ . c 

+... .<Th> E.tclock enabled} .F{bit5=L> 

+ (15.2 us)..E F 

+ E F 

+ clock E...<64 pulses >...F clock 

+ disabled E....(32 us) F disabled 

+ E F 

+ E F 

+ 

^ ++++++++ ++++++^+^+++++++++++^+++ ++++++++++++ 



CONTROL LOGIC 



Control logic or random logic having the function of controlling ^^^^T^^^ 
different drcuft blocks and pipeline stages is shown in more dotal m Figure 5. The labels A F refer 
screen positions in the picture of the above schematic DFF1.1 after dlfferentatjon 
A] Bit 21 becomes low. thereby giving a low cler P^^^jJ^^^Tta stay low. as there is 
and inversion. This forces NOR1.3 to remam or ^^^^^^^ JLi bit 21 gets 
only one-sided differentiation. The «J«hjJ rjul - Ms becornes a low 

low. so does biy 5. wh.ch causes a d.ffract.aton ^^^^a'™, ^ become , ow and disable the 
clear pulse at the clock-enable fhpflop DFF2 13 The output DFF2.S I wni me ^ 
system clock. DIVCK.7. Now only ^J^J^^^^^JS vmJ6 cannot restart 
from the synchronization signals at MON.5 and MONU3. The P nc occurs A ^ time point 

the system, because OR.5 is high. So *e ^^^%+Z£* Sop by pulling DFF1.4 .ow 
indicated by label B. which after some ^^^J^^^SSX, W»on as soon as the 
for a short time, so DFF1.6 becomes low. Bu : then i 0™«» ™* label E MON.5 could already be low. 

stored in memory (PIX1-4). until bit 5 becomes low again at label F. wh.ch disables tne y 

the next delayed horizontal sync etc. ^, bjl 21 becomes 
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10 



is 



20 



25 



30 



DFF2 13 as long as bit 21 stays high. A constant read signal is forced upon P1X1-4 for a high bit 21 trough 
OFL11 This is the image-to-sound conversion phase, for which the synchronization is much less tncky than 
for the frame grabbing, during which two asynchronous subsystems, i.e. the image sensing unit and .mage 
processing unit, have to be synchronized. The major issues are now to prevent bus cortfli*.. shrft date 
through the stages of the pipeline and output an audio sample every 64 system clock cycles . The latter task 
is handled at DFF1.12 CNT1-3 count after the system clock signal goes low. It takes some time before bit 5 
c^o Tow Lcause of the ripple carry counter, so it takes one extra dock cycle before DFF 1* go* , 
This delay prevents the loss of the last (64th) oscillator contribution when the latehes L1SIN-L2SIN are 
cleared and the last audio sample is shifted to the DA-converter through the latches L1DAC. UJDAC. 

The exclusive-OR that controls the shift signals at LADC.11 prevents the "wrap-around of ttie latest, 
rightmost pixel samples of a horizontal scanllne during frame grabbing. Without it these samples would 
contribute to the leftmost pixel samples, because the latch of the AD^onverter receives ^ ™ 

does the sample-and-hold circuit when the system clock is tabled (clock .s low) The latch would 
therefore shift old information at its inuts when the system clock is enabled agam anc Ithe clock goes h^h 
after a while, because the sample-and-hold gate opens only when the clock Is high. The short extra pulse 
through N0R1.1 and X0R3.11 at the restart of teh system clock flushes this old information and reloads the 
sample-and-hold circuit with information, i.e. a voltage, associated with the leftmost pixels. Th€ .time 
constant of the sample-and-hold circuit is approximately 10 ns. which is sufficiently short for reloading a 
reasonably accurate value for the first sample of a horizontal scanline. because the three-gate-delay pulse 
of the differentiator has a duration of approximately 30 ns. The AD-converter and Gray j^VT^E?? 
sufficient time to digitize this newly sampled value before the system clock goes high after being enabect 
So when the system clock becomes high, a proper first pixel value is shifted through the latch LADC into 

V ' d< in ftTse^o'rftht timing of the image processing unit is considered in more detail. The discussion will 
therefore become more dependent on specific hardware choices. A closer look is needed at the image-to- 
sound conversion operations that take place in parallel during each phase of the s MHz system dock 
giving 250 ns high (H) and low (L) levels. Time delay estimates are taken from manufacturer s data books 
and data sheets, mostly from Texas Instruments. Mnemonic names refer to the circuit topology descnbed in 
an appendix, and to the circuit desig described in the previous text. The process numbers 1 through 5 in 
front of the sets of operations taking place in parallel indicate the route taken by a particular oscillator 
sample. I.e. they indicate the stages in the pipeline. A sample starts at process 1. then proceeds to process 
2 eta. Processes 1. 3 and 5 take place in parallel at clock = L. but they operate on different oscllator 
samples. The same applies to processes 2 and 4 at clock = H. 



35 


Clock -> (250 ns) 


Typical 
delay: 




■ Process 1: 




40 


increment counters CNT1-3 

put count to address DFI EPROMs (not via latch) 

use address to read DFI EPROMS 

Total: 


70 ns 
0 ns 
250 ns 
320 ns 



45 



The reading of DFI continues during process 2, so a total dealy > 250 ns is allowed for process 1. DFI will 
need another 70 ns of the next clock phase H. 



50 



" Process 3: 



shift sum to Fl SRAMs, latches L1FI, L2F1 (old address, count not yet shifted) 

write sum into Fl SRAMs (Tl p.8-14) 

Total: 



20 ns 
150 ns 
170 ns 



55 
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& (simultaneously): 




shift brighness from PIX to address SIN1.2 EPROMs. L3FI 
& shift phase to exclusive ORs XOR1-3. latches L3F! ( L4FI 
exclusive ORs give ones complement yes/no 
use result as address to read SIN1 ,2 EPROMs 
Total: 


max(20 ns 

,20 ns) 

20 ns 
250 ns 
290 ns 



" The reading of SIN continues during process 4, so a total delay > 250 ns is allowed for the second part of 
process 3. SIN will need another 40 ns of the next clock phase H. 



75 



* process 5: 



shift superposition sample, latches L1SIN, L2SIN. L3SIN 
Total: 



20 ns 
20 ns 



20 



25 



64th cycle is detected by observing the delayed trailing edge of the 6th bit of counter CNTi. 



30 



clock -> H (250 ns) 




* process 2: 


shift count to address Fl SRAMs. latches L1CNT, L2CNT 

use address to read Fl SRAMs 

& continue reading DF1 EPROMs (process 1) 

sum results Fl and DFI in AD1-4 

Total: 


20 ns 
rnax(150 ns 
70 ns 
50 ns 
220 ns 



& (simultaneously) 




shift count to address P1X1-4 SRAMs, latches L1 CNT, L2CNT 

use address to read PIX SRAMs 

Total: 


20 ns 
150 ns 
170 ns 



45 



Both totals are less than the 250 ns clock phase available. 



50 



' process 4: 



continue reading SIN EPROMs (process 3) 

add result to obtain superposition sample. ALU1-5 

Total: 



40 ns 
100 ns 
140 ns 



55 



In the schematic below the main actions taking place near the 64 clock cycles boundary are ^indicated 
^mSrSuSrposition is abbreviated to "sup'. The numbers extend,ng the words ind,cate the 

associated oscillator, i.e. 0 through 63. 
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Readphi63 /Reads in6 2 (tail )/addsup62 
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Count00/Writephi63/Shiftsup62 


H 


00 


ReadDhiOO/Readsin63(tail)/addsup63 


L 


01 to DAC & clear sup; 


Count.01 /WritephiOO/Shif tsup63 


H 


01 


ReadphiOl /ReadsinOO(tail) /addsupOO 


L 




• Count02/Writephi01/Shiftsup00 
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Claims 

1. Transformation system for converting visual images into acoustical representations, characterized in that 
visual images are online transformed into acoustical representations substantially retaining relevant informa- 
tion content of the visual images. 

2. Transformation system as claimed in Claim 1 , 

characterized in that the system is provided with a low-weight low-power image sensing system. 

3. Transformation system as claimed in Claim 1 or Claim 2, 

characterized in that images sensed by an image sensing system are digitized and stored for subsequent 
digital processing. 

4. Transformation system as claimed in Claim 1 or Claim 3, 

characterized in that an image processing unit thereof is based on scanning images on a rectangular grid of 
pixels, each pixel having a particular brightness value from a discrete set of brightness values. 

5. Transformation system as claimed in Claim 4, 

characterized in that the rectangular grid contains 64x64 pixels with one out of 16 possible brightness 
values for every pixel. 

6. Transformation system as claimed in any one of the proceeding Claims, characterized in that an image 
processing unit thereof contains a pipilined digital processing system used to convert digitized images or 
other similarly organized digital information into acoustical representations. 

7. Transformation system as claimed in Claim 6, 

characterized in that a pipelined digital processing system is organized for a high degree of parallel 
processing to obtain real-time transformations. 

8. Transformation system as claimed in any one of the preceeding Claims, characterized in that an image is 
read out and processed scan lino by scanline. 

9. Transformation system as claimed in Claim 5 or Claim 8, 

characterized in that an image is read out and processed in 64 scanlines with 64 pixels on each scanline. 

10. Transformation system as claimed in Claims 1, 4, 5 or 9, 

characterized in that scanline position is represented in time sequence, pixel position on each scanline is 
represented in frequency and pixel brightness is represented in amplitude of an acoustical representation. 

11. Transformation system as claimed in Claim 6, 
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characterized in that an image processing unit is portable and low-power for mobile battery-operated 
applications. 
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